\[\newcommand{\E}{\mathrm{E}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\se}{\text{se}} \newcommand{\Lagr}{\mathcal{L}} \newcommand{\lagr}{\mathcal{l}}\]
Regression function
\[r(x) = \E (Y | X = x) = \int y f(y|x) dy\]
Estimate the form
\[(Y_1, X_1), \ldots, (Y_n, X_n) \sim F_{X,Y}\]
\[r(x) = \beta_0 + \beta_1 x\]
\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
Fitted line
\[\hat{r}(x) = \hat{\beta}_0 + \hat{\beta}_1 x\]
Fitted values
\[\hat{Y}_i = \hat{r}(X_i)\]
Residuals
\[\hat{\epsilon}_i = Y_i - \hat{Y}_i = Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x)\]
Residual sum of squares
\[RSS = \sum_{i=1}^n \hat{\epsilon}_i^2\]
Values \(\hat{\beta}_0, \hat{\beta}_1\) that
\[\min(RSS)\]
\[ \begin{aligned} \hat \beta_1 & = \dfrac{ \sum_{i=1}^n Y_iX_i - \bar{Y}_n \sum_{i=1}^nX_i}{ \sum_{i=1}^n X_i^2 - \bar{X}_n\sum_{i=1}^nX_i}\\ \hat \beta_0 & = \bar{Y}_n - \hat{\beta}_1 \bar{X}_n \\ \hat{\sigma}^2 &= \left( \frac{1}{n - 2} \right) \sum_{i=1}^n \hat{\epsilon}_i^2 \end{aligned} \]
\[\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{u}\]
\[(\mathbf{X'X})^{-1}\mathbf{X'Y} = \boldsymbol{\beta}\]
\(\epsilon_i | X_i \sim N(0, \sigma^2\)
\[Y_i | X_i \sim N(\mu_i, \sigma^2)\]
Likelihood function
\[ \begin{align} \prod_{i=1}^n f(X_i, Y_i) &= \prod_{i=1}^n f_X(X_i) f_{Y | X} (Y_i | X_i) \\ &= \prod_{i=1}^n f_X(X_i) \times \prod_{i=1}^n f_{Y | X} (Y_i | X_i) \\ &= \Lagr_1 \times \Lagr_2 \end{align} \]
\[ \begin{align} \Lagr_1 &= \prod_{i=1}^n f_X(X_i) \\ \Lagr_2 = \prod_{i=1}^n f_{Y | X} (Y_i | X_i) \end{align} \]
Conditional likelihood
\[ \begin{align} \Lagr_2 &\equiv \Lagr(\beta_0, \beta_1, \sigma^2) \\ &= \prod_{i=1}^n f_{Y | X}(Y_i | X_i) \\ &\propto \frac{1}{\sigma} \exp \left\{ -\frac{1}{2\sigma^2} \sum_{i=1}^n (Y_i - \mu_i)^2 \right\} \end{align} \]
Conditional log-likelihood
\[\lagr(\beta_0, \beta_1, \sigma^2) = -n \log(\sigma) - \frac{1}{2\sigma^2} \left( Y_i - (\beta_0 + \beta_1 X_i) \right)^2\]
Equivalent to \(\min(RSS)\)
\[RSS = \sum_{i=1}^n \hat{\epsilon}_i^2 = \sum_{i=1}^n \left( Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x) \right)\]
\(\hat{\beta}^T = (\hat{\beta}_0, \hat{\beta}_1)^T\)
\[ \begin{align} \E (\hat{\beta} | X^n) &= \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix} \\ \Var (\hat{\beta} | X^n) &= \frac{\sigma^2}{n s_X^2} \begin{pmatrix} \frac{1}{n} \sum_{i=1}^n X_i^2 & -\bar{X}^n \\ -\bar{X}^n & 1 \end{pmatrix} \end{align} \]
\[s_X^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X}_n)^2\]
\[ \begin{align} \widehat{\se} (\hat{\beta}_0) &= \frac{\hat{\sigma}}{s_X \sqrt{n}} \sqrt{\frac{ \sum_{i=1}^n X_i^2}{n}} \\ \widehat{\se} (\hat{\beta}_1) &= \frac{\hat{\sigma}}{s_X \sqrt{n}} \end{align} \]
Consistency
\[\hat{\beta}_0 \xrightarrow{P} \beta_0 \, \text{and} \, \hat{\beta}_1 \xrightarrow{P} \beta_1\]
Asymptotic normality
\[\frac{\hat{\beta}_0 - \beta_0}{\widehat{\se}(\hat{\beta}_0)} \leadsto N(0,1) \, \text{and} \, \frac{\hat{\beta}_1 - \beta_1}{\widehat{\se}(\hat{\beta}_1)} \leadsto N(0,1)\]
Approximate \(1 - \alpha\) confidence intervals for \(\beta_0\) and \(\beta_1\) are
\[\hat{\beta}_0 \pm z_{\alpha / 2} \widehat{\se}(\hat{\beta}_0) \, \text{and} \, \hat{\beta}_1 \pm z_{\alpha / 2} \widehat{\se}(\hat{\beta}_1)\]
The Wald test for testing \(H_0: \beta_1 = 0\) versus \(H_1: \beta_1 \neq 0\) is reject \(H_0\) if \(\mid W \mid > z_{\alpha / 2}\) where
\[W = \frac{\hat{\beta}_1}{\widehat{\se}(\hat{\beta}_1)}\]
\[\E(\epsilon_i) \equiv E(\epsilon_i | X_i) = 0\]
\[ \begin{aligned} \mu_i \equiv E(Y_i) \equiv E(Y | X_i) &= E(\beta_0 + \beta_1 X_i + \epsilon_i) \\ \mu_i &= \beta_0 + \beta_1 X_i + E(\epsilon_i) \\ \mu_i &= \beta_0 + \beta_1 X_i + 0 \\ \mu_i &= \beta_0 + \beta_1 X_i \end{aligned} \]
The variance of the errors is the same regardless of the values of \(X\)
\[\Var(\epsilon_i | X_i) = \sigma^2\]
The errors are assumed to be normally distributed
\[\epsilon_i \mid X_i \sim N(0, \sigma^2)\]
Observational study
\[\epsilon_i \sim N(0, \sigma^2), \text{for } i = 1, \dots, n\]
Influence
\[\text{Influence} = \text{Leverage} \times \text{Discrepancy}\]
Hat statistic
\[h_i = \frac{1}{n} + \frac{(X_i - \bar{X})^2}{\sum_{j=1}^{n} (X_{j} - \bar{X})^2}\]
Larger values indicate higher leverage
Standardized residual
\[\hat{\epsilon}_i ' \equiv \frac{\hat{\epsilon}_i}{S_{E} \sqrt{1 - h_i}}\]
Studentized residual
\[\hat{\epsilon}_i^{\ast} \equiv \frac{\hat{\epsilon}_i}{S_{E(-i)} \sqrt{1 - h_i}}\]
\(\text{DFBETA}_{ij}\)
\[D_{ij} = \hat{\beta_1j} - \hat{\beta}_{1j(-i)}, \text{for } i=1, \dots, n \text{ and } j = 0, \dots, k\]
\(\text{DFBETAS}_{ij}\)
\[D^{\ast}_{ij} = \frac{D_{ij}}{SE_{-i}(\beta_{1j})}\]
Cook’s D
\[D_i = \frac{\hat{\epsilon}^{'2}_i}{k + 1} \times \frac{h_i}{1 - h_i}\]
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -12.1 2.54 -4.76 0.00000657
## 2 age 0.219 0.0448 4.88 0.00000401
## 3 tenure -0.0669 0.0643 -1.04 0.300
## 4 unified 0.718 0.458 1.57 0.121
Anything exceeding twice the average \(\bar{h} = \frac{k + 1}{n}\) is noteworthy
## # A tibble: 9 x 10
## Congress congress nulls age tenure unified year hat student cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1st 1 0 49.8 0.800 1 1789 0.0974 0.330 0.00296
## 2 3rd 3 0 52.8 4.20 0 1793 0.113 0.511 0.00841
## 3 12th 12 0 49 6.60 1 1811 0.0802 0.669 0.00980
## 4 17th 17 0 59 16.6 1 1821 0.0887 -0.253 0.00157
## 5 20th 20 0 61.7 17.4 1 1827 0.0790 -0.577 0.00719
## 6 23rd 23 0 64 18.4 1 1833 0.0819 -0.844 0.0159
## 7 34th 34 0 64 14.6 0 1855 0.0782 -0.561 0.00671
## 8 36th 36 0 68.7 17.8 0 1859 0.102 -1.07 0.0326
## 9 99th 99 3 71.9 16.7 0 1985 0.0912 0.295 0.00221Anything outside of the range \([-2,2]\) is discrepant
## # A tibble: 7 x 10
## Congress congress nulls age tenure unified year hat student cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 67th 67 6 66 9 1 1921 0.0361 2.14 0.0415
## 2 74th 74 10 71.1 14.2 1 1935 0.0514 4.42 0.223
## 3 90th 90 6 64.7 13.3 1 1967 0.0195 2.49 0.0292
## 4 91st 91 6 65.1 13 1 1969 0.0189 2.42 0.0269
## 5 92nd 92 5 62 9.20 1 1971 0.0146 2.05 0.0150
## 6 98th 98 7 69.9 14.7 0 1983 0.0730 3.02 0.165
## 7 104th 104 8 60.6 12.5 1 1995 0.0208 4.48 0.0897\(D_i > \frac{4}{n - k - 1}\)
## # A tibble: 4 x 10
## Congress congress nulls age tenure unified year hat student cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 67th 67 6 66 9 1 1921 0.0361 2.14 0.0415
## 2 74th 74 10 71.1 14.2 1 1935 0.0514 4.42 0.223
## 3 98th 98 7 69.9 14.7 0 1983 0.0730 3.02 0.165
## 4 104th 104 8 60.6 12.5 1 1995 0.0208 4.48 0.0897## [1] 0.232
## [1] 0.258
## [1] 1.68
## [1] 1.29
\[\epsilon_i | X_i \sim N(0, \sigma^2)\]
## # A tibble: 3,997 x 4
## age sex compositeHourlyWages yearsEducation
## <int> <chr> <dbl> <int>
## 1 40 Male 10.6 15
## 2 19 Male 11 13
## 3 46 Male 17.8 14
## 4 50 Female 14 16
## 5 31 Male 8.2 15
## 6 30 Female 17.0 13
## 7 61 Female 6.7 12
## 8 46 Female 14 14
## 9 43 Male 19.2 18
## 10 17 Male 7.25 11
## # ... with 3,987 more rows
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -8.12 0.599 -13.6 5.27e- 41
## 2 sexMale 3.47 0.207 16.8 4.04e- 61
## 3 yearsEducation 0.930 0.0343 27.1 5.47e-149
## 4 age 0.261 0.00866 30.2 3.42e-180
## [1] 967 3911
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.10 0.0380 28.9 1.97e-167
## 2 sexMale 0.224 0.0131 17.1 2.16e- 63
## 3 yearsEducation 0.0559 0.00217 25.7 2.95e-135
## 4 age 0.0182 0.000549 33.1 4.50e-212
## [1] 2760 3267
Homoscedasticity
\[\text{Var}(\epsilon_i) = \sigma^2\]
\[\widehat{\se}(\hat{\beta}_{1j}) = \sqrt{\hat{\sigma}^{2} (X'X)^{-1}_{jj}}\]
Heteroscedasticity
\[Y_i = 2 + 3X_i + \epsilon\]
\[\epsilon_i \sim N(0,1)\]
\[Y_i = 2 + 3X_i + \epsilon_i\]
\[\epsilon_i \sim N(0,\frac{X}{2})\]
##
## studentized Breusch-Pagan test
##
## data: sim_homo_mod
## BP = 0.3, df = 1, p-value = 0.6
##
## studentized Breusch-Pagan test
##
## data: sim_hetero_mod
## BP = 200, df = 1, p-value <2e-16
\(\epsilon_i \sim N(0, \sigma_i^2)\)
\[ \begin{bmatrix} \sigma_1^2 & 0 & 0 & 0 \\ 0 & \sigma_2^2 & 0 & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & 0 & \sigma_n^2 \\ \end{bmatrix} \]
We can define the reciprocal of each variance \(\sigma_i^2\) as the weight \(w_i = \frac{1}{\sigma_i^2}\), then let matrix \(\mathbf{W}\) be a diagonal matrix containing these weights:
\[ \mathbf{W} = \begin{bmatrix} \frac{1}{\sigma_1^2} & 0 & 0 & 0 \\ 0 & \frac{1}{\sigma_2^2} & 0 & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & 0 & \frac{1}{\sigma_n^2} \\ \end{bmatrix} \]
So rather than following the traditional linear regression estimator
\[\hat{\mathbf{\beta_1}} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{y}\]
we can substitute in the weighting matrix \(\mathbf{W}\):
\[\hat{\mathbf{\beta_1}} = (\mathbf{X}' \mathbf{W} \mathbf{X})^{-1} \mathbf{X}' \mathbf{W} \mathbf{y}\]
\[\sigma_{i}^2 = \frac{\sum(w_i \hat{\epsilon}_i^2)}{n}\]
This is equivalent to minimizing the weighted sum of squares, according greater weight to observations with smaller variance.
How do we estimate the weights \(W_i\)?
For example, using the first approach on our original SLID model:
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -8.12 0.599 -13.6 5.27e- 41
## 2 sexMale 3.47 0.207 16.8 4.04e- 61
## 3 yearsEducation 0.930 0.0343 27.1 5.47e-149
## 4 age 0.261 0.00866 30.2 3.42e-180
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -8.10 0.0160 -506. 0
## 2 sexMale 3.47 0.00977 356. 0
## 3 yearsEducation 0.927 0.00142 653. 0
## 4 age 0.261 0.000170 1534. 0
We see some mild changes in the estimated parameters, but drastic reductions in the standard errors. The problem is that this reduction is potentially biased through the estimated covariance matrix because the sampling error in the estimates should reflect the additional source of uncertainty, which is not explicitly accounted for just basing it on the original residuals. Instead, it would be better to model the weights as a function of relevant explanatory variables.1
Alternatively, we could instead attempt to correct for heteroscedasticity only in the standard error estimates. This produces the same estimated parameters, but adjusts the standard errors to account for the violation of the constant error variance assumption (we won’t falsely believe our estimates are more precise than they really are.) One major estimation procedure are Huber-White standard errors (also called robust standard errors) which can be recovered using the car::hccm() function:2
## # A tibble: 4 x 6
## term estimate std.error statistic p.value std.error.rob
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -8.12 0.599 -13.6 5.27e- 41 0.636
## 2 sexMale 3.47 0.207 16.8 4.04e- 61 0.207
## 3 yearsEducation 0.930 0.0343 27.1 5.47e-149 0.0385
## 4 age 0.261 0.00866 30.2 3.42e-180 0.00881
Notice that these new standard errors are a bit larger than the original model, accounting for the increased uncertainty of our parameter estimates due to heteroscedasticity.
Define the partial residual for the \(j\)th explanatory variable:
\[\hat{\epsilon}_i^{(j)} = \hat{\epsilon}_i + \hat{\beta}_j X_{ij}\]
In essence, calculate the least-squares residual (\(\hat{\epsilon}_i\)) and add to it the linear component of the partial relationship between \(Y\) and \(X_j\). Finally, we can plot \(X_j\) versus \(\hat{\epsilon}_i^{(j)}\) and assess the relationship. For instance, consider the results of the logged wage model from earlier:
The solid lines are GAMs, while the dashed lines are linear least-squares fits. For age, the partial relationship with logged wages is not linear - some transformation of age is necessary to correct this. For education, the relationship is more approximately linear except for the discrepancy for individual with very low education levels.
We can correct this by adding a squared polynomial term for age, and square the education term. The resulting regression model is:
\[\log(\text{Wage}) = \beta_0 + \beta_1(\text{Male}) + \beta_2 \text{Age} + \beta_3 \text{Age}^2 + \beta_4 \text{Education}^2\]
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.397 0.0578 6.87 7.62e- 12
## 2 sexMale 0.221 0.0124 17.8 3.21e- 68
## 3 I(yearsEducation^2) 0.00181 0.0000786 23.0 1.19e-109
## 4 age 0.0830 0.00319 26.0 2.93e-138
## 5 I(age^2) -0.000852 0.0000410 -20.8 3.85e- 91
Because the model is now nonlinear in both age and education, we need to rethink how to draw the partial residuals plot. The easiest approach is to plot the partial residuals for both age and education against the original explanatory variable. For age, that is
\[\hat{\epsilon}_i^{\text{Age}} = 0.083 \times \text{Age}_i -0.0008524 \times \text{Age}^2_i + \hat{\epsilon}_i\]
and for education,
\[\hat{\epsilon}_i^{\text{Education}} = 0.002 \times \text{Education}^2_i + \hat{\epsilon}_i\]
On the same graph, we also plot the partial fits for the two explanatory variables:
\[\hat{Y}_i^{(\text{Age})} = 0.083 \times \text{Age}_i -0.0008524 \times \text{Age}^2_i\]
and for education,
\[\hat{Y}_i^{(\text{Education})} = 0.002 \times \text{Education}^2_i\]
On the graphs, the solid lines represent the partial fits and the dashed lines represent the partial residuals. If the two lines overlap significantly, then the revised model does a good job accounting for the nonlinearity.
Perfect collinearity is incredibly rare, and typically involves using transformed versions of a variable in the model along with the original variable. For example, let’s estimate a regression model explaining mpg as a function of displ, wt, and cyl:
##
## Call:
## lm(formula = mpg ~ disp + wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.403 -1.403 -0.495 1.339 6.072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.10768 2.84243 14.46 1.6e-14 ***
## disp 0.00747 0.01184 0.63 0.5332
## wt -3.63568 1.04014 -3.50 0.0016 **
## cyl -1.78494 0.60711 -2.94 0.0065 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.59 on 28 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.815
## F-statistic: 46.4 on 3 and 28 DF, p-value: 5.4e-11
Now let’s say we want to recode displ so it is centered around it’s mean and re-estimate the model:
##
## Call:
## lm(formula = mpg ~ disp + wt + cyl + disp_mean, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.403 -1.403 -0.495 1.339 6.072
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.10768 2.84243 14.46 1.6e-14 ***
## disp 0.00747 0.01184 0.63 0.5332
## wt -3.63568 1.04014 -3.50 0.0016 **
## cyl -1.78494 0.60711 -2.94 0.0065 **
## disp_mean NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.59 on 28 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.815
## F-statistic: 46.4 on 3 and 28 DF, p-value: 5.4e-11
Oops. What’s the problem? disp and disp_mean are perfectly correlated with each other:
Because they perfectly explain each other, we cannot estimate a linear regression model that contains both variables.3 Fortunately R automatically drops the second variable so it can estimate the model. Because of this, perfect multicollinearity is rarely problematic in social science.
Instead consider the credit dataset:
Age and limit are not strongly correlated with one another, so estimating a linear regression model to predict an individual’s balance as a function of age and limit is not a problem:
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -173. 43.8 -3.96 9.01e- 5
## 2 age -2.29 0.672 -3.41 7.23e- 4
## 3 limit 0.173 0.00503 34.5 1.63e-121
But what about using an individual’s credit card rating instead of age? It is likely a good predictor of balance as well:
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -378. 45.3 -8.34 1.21e-15
## 2 limit 0.0245 0.0638 0.384 7.01e- 1
## 3 rating 2.20 0.952 2.31 2.13e- 2
By replacing age with rating, we developed a problem in our model. The problem is that limit and rating are strongly correlated with one another:
In the regression model, it is difficult to parse out the independent effects of limit and rating on balance, because limit and rating tend to increase and decrease in association with one another. Because the accuracy of our estimates of the parameters is reduced, the standard errors increase. This is why you can see above that the standard error for limit is much larger in the second model compared to the first model.
A correlation or scatterplot matrix would help to reveal any strongly correlated variables:
Here it is very clear that limit and rating are strongly correlated with one another.
Unfortunately correlation matrices may not be sufficient to detect collinearity if the correlation exists between three or more variables (aka multicollinearity) while not existing between any two pairs of these variables. Instead, we can calculate the variance inflation factor (VIF) which is the ratio of the variance of \(\hat{\beta}_{1j}\) when fitting the full model divided by the variance of \(\hat{\beta}_{1j}\) if fit on its own model. We can use the car::vif() function in R to calculate this statistic for each coefficient. A good rule of thumb is that a VIF statistic greater than 10 indicates potential multicollinearity in the model. Applied to the credit regression models above:
## age limit
## 1.01 1.01
## limit rating
## 160 160
Drop one or more of the collinear variables from the model
This is not a good idea, even if it makes your results “significant”. By omitting the variable, you are completely re-specifying your model in direct contradiction to your theory. If your theory suggests that a variable can be dropped, go ahead. But if not, then don’t do it.
The more observations, the better. It could at least decrease your standard errors and give you more precise estimates. And if you add “odd” or unusual observations, it could also reduce the degree of multicollinearity.
If the variables are indicators of the same underlying concept, you can combine them into an index variable. This could be an additive index where you sum up comparable covariates or binary indicators. Alternatively, you could create an index via principal components analysis.
Shrinkage methods involve fitting a model involving all \(p\) predictors and shrinking the estimated coefficients towards zero. This shrinkage reduces variance in the model. When multicollinearity is high, the variance of the estimator \(\hat{\beta}_1\) is also high. By shrinking the estimated coefficient towards zero, we may increase bias in exchange for smaller variance in our estimates.
Omitted here for time purposes. You can find details on this estimation procedure on the internet.↩
This function really returns the “sandwich” estimator of the variance-covariance matrix, so we need to further take the square root of the diagonal of this matrix.↩
Basically we cannot invert the variance-covariance matrix of \(\mathbf{X}\) because the collinear columns in \(\mathbf{X}\) are perfectly linearly dependent on each other. Because of this, we cannot get parameter estimates or standard errors for the model.↩